Skip to content

Conversation

@marceloneppel
Copy link
Member

@marceloneppel marceloneppel commented Sep 30, 2025

Issue

The charm is not yet supporting refresh through refresh v3.

Solution

Added refresh v3 implementation based on the developer docs and also based on the implementations from PostgreSQL VM and Kafka K8s.

Use set_unit_status to set the unit status and properly set statuses during refresh.

The old refresh/upgrade logic/files were removed.

Correctly reload pgBackRest TLS server configuration with SIGHUP to avoid interrupting the data replication between the primary and the replica (more details in the comments on this PR).

After this PR is approved, 2a806f7 must be reverted, then the PR should me merged, to properly generate the tags in the repo (this is the same as stated at canonical/postgresql-operator#866 (comment)).

Checklist

  • I have added or updated any relevant documentation.
  • I have cleaned any remaining cloud resources from my accounts.

Signed-off-by: Marcelo Henrique Neppel <[email protected]>
@codecov
Copy link

codecov bot commented Sep 30, 2025

Codecov Report

❌ Patch coverage is 42.43902% with 118 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.98%. Comparing base (48d4eb4) to head (86860f1).
⚠️ Report is 6 commits behind head on 16/edge.

Files with missing lines Patch % Lines
src/charm.py 39.53% 64 Missing and 14 partials ⚠️
src/refresh.py 41.93% 17 Missing and 1 partial ⚠️
src/relations/async_replication.py 47.05% 8 Missing and 1 partial ⚠️
src/relations/postgresql_provider.py 10.00% 9 Missing ⚠️
src/backups.py 86.66% 1 Missing and 1 partial ⚠️
src/relations/logical_replication.py 0.00% 2 Missing ⚠️

❌ Your project check has failed because the head coverage (65.98%) is below the target coverage (70.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           16/edge    #1115      +/-   ##
===========================================
- Coverage    67.44%   65.98%   -1.47%     
===========================================
  Files           18       18              
  Lines         3947     3904      -43     
  Branches       564      568       +4     
===========================================
- Hits          2662     2576      -86     
- Misses        1112     1143      +31     
- Partials       173      185      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@marceloneppel marceloneppel added the enhancement New feature, UI change, or workload upgrade label Oct 1, 2025
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
…check for primary on unit zero

Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
…efresh is not ready (#1156)

* Reorder call to refresh logic and don't immediatelly exit hook when refresh is not ready

Signed-off-by: Marcelo Henrique Neppel <[email protected]>

* Log more info when creating a backup and send SIGHUP when the pgBackRest is already active

Signed-off-by: Marcelo Henrique Neppel <[email protected]>

* Change log level of messages to debug

Signed-off-by: Marcelo Henrique Neppel <[email protected]>

* Update unit test

Signed-off-by: Marcelo Henrique Neppel <[email protected]>

* Fix integration test job collection

Signed-off-by: Marcelo Henrique Neppel <[email protected]>

---------

Signed-off-by: Marcelo Henrique Neppel <[email protected]>
- '**.md'
- .github/renovate.json5
- 'docs/**'
- 'terraform/**'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to match the rule we have in the VM charm.

Comment on lines +36 to +39
[pathlib.Path.home() / "go/bin/spread", "-list", "github-ci"],
capture_output=True,
check=True,
text=True,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update to list tests now that we're using charmcraft v4.

Comment on lines +824 to +825
"--log-level-file=debug",
"--log-subprocess",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More logging added (this helped to identify why the GCP PITR test was failing after refresh V3 was added in this PR: a SIGTERM appeared in the logs).

Comment on lines +1314 to +1317
logger.debug("Sending SIGHUP to pgBackRest TLS server to reload configuration")
self.container.pebble.send_signal(
signal.SIGHUP, services=[self.charm.pgbackrest_server_service]
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always restarting the service was causing a SIGTERM to be sent to the pgBackRest TLS server in the middle of the process where the replica gets the latest data from the primary to create a backup. This started to happen because of some more deferred events happening. A SIGHUP is enough to update the TLS server when the certificates change.

Comment on lines +55 to +62
# ops_test.model.deploy(
# DATABASE_APP_NAME,
# num_units=3,
# channel="16/edge",
# trust=True,
# config={"profile": "testing"},
# base=CHARM_BASE_NOBLE,
# ),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented to be restored after the first release with refresh v3 is available.

Comment on lines +50 to +57
# ops_test.model.deploy(
# DATABASE_APP_NAME,
# num_units=3,
# channel="16/stable",
# trust=True,
# config={"profile": "testing"},
# base=CHARM_BASE_NOBLE,
# ),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented to be restored after the first release with refresh v3 is available.

assert standbys == len(ops_test.model.applications[app_name].units) - 1


async def switchover_to_unit_zero(ops_test: OpsTest) -> None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used to reduce the number of switchovers to the minimum (as we expect since the previous refresh/upgrade implementation).

Comment on lines -19 to +20
UNTRUST_ERROR_MESSAGE = f"Insufficient permissions, try: `juju trust {APP_NAME} --scope=cluster`"
UNTRUST_ERROR_MESSAGE = (
f"Run `juju trust {APP_NAME} --scope=cluster`. Needed for in-place refreshes"
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message changes a bit with refresh v3.

@marceloneppel marceloneppel marked this pull request as ready for review November 20, 2025 12:39
@marceloneppel marceloneppel requested review from a team, dragomirp, onkolahmet and taurus-forever and removed request for a team November 20, 2025 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature, UI change, or workload upgrade Libraries: Out of sync

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants